For the past two years, the corona virus pandemic has been a subject of vexation. A specific concern that has arisen throughout the pandemic, is whether or not certain geographical areas, are more prone to outbreaks than others, and whether government interventions to reduce corona virus infections were effective. In this paper, we shall utilize techniques such as Multiple Linear Regression, and Bayesian Inference to determine whether or not certain geographical regions were more prone to corona virus outbreaks than others, and whether government efforts to reduce corona virus infections were effective. We then continue to explore the potential differences in the regions and government intervention effects, using techniques, such as Tukey’s HSD, and Likelihood Ratio tests. Finally, we discuss whether or not causal inference can be made directly on the factors of interest.
The data of interest, was obtained from the World Health Organization (WHO), and the COVID19 Data Hub. In order to examine whether government intervention (effectively) reduced the number of corona virus cases, we selected four variables representing various forms of government interventions/restrictions. Of interest to us in these variables, are their effects on the number of new cases. Presented below, are these indicator variables, along with a brief description on its contents, and coding:
facial_coverings: An indicator variable pertaining to the severity of the facial-covering requirements, for the country of the observation.
Variable Coding:
0 - No policy
1 - Recommended
2 - Required in some public spaces outside the home with other people present, or when social distancing is not possible.
3 - Required in all public spaces, with people present.
4 - Required in all public spaces, regardless if people are present or not.
school_closing: An indicator variable pertaining to the severity of the (temporary) shut down of in-person schools.
- 0 - No restrictions.
- 1 - Recommended.
- 2 - Required for some grade levels.
- 3 - Required for all grade levels.gatherings_restrictions: An indicator variable pertaining to the restrictions imposed on public gatherings, for the observation’s country.
Variable Coding:
0 - No restrictions
1 - Restrictions for very large gatherings.
2 - Restrictions for large gatherings.
3 - Restrictions for medium-sized gatherings.
4 - Restrictions for small gatherings.
cancel_events: An indicator variable pertaining to the restrictions imposed on local events.
Variable Coding:
0 - No restrictions.
1 - Recommended to cancel the event.
2 - Required to cancel the event.
These specific variables were selected among other government restriction indicators, as they were among the most controversial restrictions taken up by various governments. Additionally, the effectiveness of these measures were under constant scrutiny. Next, we utilized the following variables from the WHO’s COVID19 - Data bank:
WHO_region: The region of the observation’s occurrence.
Regions:
Africa
Americas
Europe
South-East Asia
Western Pacific
New_cases: The number of newly reported cases.
New_deaths: The number of newly reported deaths resulting from coronavirus, in the past seven days.
In these data, it is worth noting that there were negative-valued observations in the New_cases variable. The reason for this obscurity, is that corrections were submitted into the data, to account for false positive test results. Additionally, there was coding present within the indicator variables, that were coded with a negative sign. These negative codes indicate that it represents a best guess of the policy enforced in the observation’s country. For example: a \(-1\) in the cancel_events column would imply that the observation’s country likely recommended the cancellation of events.
For the purposes of this paper, we produce an “ANCOVA-styled” multiple regression model, treating the number of New_cases as our response variable, with the WHO_region , facial_coverings, school_closing, gatherings_restrictions, and cancel_events as factors, and the New_deaths variable as a covariate. In order to ensure the validity of our results, we will exclude any negative-coded indicator variables present within the data.
We argue that it is more efficient to compare each region rather than country, as:
Countries within the same region are likely to experience more international travel from its neighboring countries, and could lead to spatial dependencies in the number of corona virus cases.
Any model comparing the different regions will have less parameters than a model comparing each individual country, and hence, will result in a lower error variance.
Countries within each region are culturally similar; wearing face masks is a common occurrence in the East Asian countries, but is not so in the American countries.
For the purposes of model building, if we compare by regions instead of by countries, we reduce the error variability – leading to increased power for tests of significance.
To begin, we first conduct some preliminary analysis of the dataset. One aspect of the data that is important to consider, is the number of missing values present. In order to evaluate how many missing values that we may be dealing with, we will utilize the gg_miss_var function, from the naniar package.
Figure 1: Missing Values by Variable
Presented from Figure 1, is a missing value plot, by variable for our dataset. As evident from the plot above, there are no missing values present within the dataset, after filtering out the negative-valued indicator observations. This will make the process of model building and data analysis easier.
Next, it may be useful to analyze the number of new corona virus cases by region, in order to gain an understanding of how the distribution of new cases differ by region.
| Region | New Cases |
|---|---|
| Africa | 7323780 |
| Americas | 12702114 |
| Eastern Mediterranean | 10353758 |
| Europe | 95917321 |
| South-East Asia | 3079866 |
| Western Pacific | 7758647 |
Table 1 shows the number of new cases per region (as designated by the World Health Organization). However, while it would be easy to conclude that Europe experienced the most corona virus cases, this summary does not take into account the population residing in each region. To see this, consider a numerical summary, as presented below:
## WHO_region New_cases Cumulative_cases
## Africa :28473 Min. :-32952 Min. : 0
## Americas :17488 1st Qu.: 0 1st Qu.: 878
## Eastern Mediterranean:13395 Median : 40 Median : 14023
## Europe :30071 Mean : 1346 Mean : 253920
## South-East Asia : 2981 3rd Qu.: 487 3rd Qu.: 137890
## Western Pacific : 9495 Max. :500563 Max. :21127104
## New_deaths Cumulative_deaths
## Min. : -43.00 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 11
## Median : 0.00 Median : 207
## Mean : 15.59 Mean : 4595
## 3rd Qu.: 6.00 3rd Qu.: 2304
## Max. :8786.00 Max. :196818
As evident from the summary function, we can see that the majority of the observations in this dataset were from Europe. The region with the least amount of observations in this dataset, was South East Asia. This is to be expected, as we are dealing with observational data. Next, one may notice that the summaries for the New_cases and New_deaths variables have negative values as their minimum. As stated earlier, the reason behind these negative numbers, is that countries made corrections on the number of corona virus cases on previous days. We can see that on average, there were \(1,346\) New cases for a given observation in this dataset. However, the median number of new cases per observation, was only \(40\). This implies that the distribution of new cases was heavily skewed right. The most probable reason for this scenario, is due to differences in population between the countries. Large countries such as the United States, or, China, are bound to have a higher population, and hence, a (potentially) higher number of new cases. Therefore, in order to have a more interpretable response variable, we shall standardize each region’s number of new cases by their respective populations.
To re-scale the data, we will utilize the population for each country in a region (as drawn from The World Data Bank1), and divide each country’s number of New_cases in a particular region, by the summed population of each country in the region. For the sake of interpretability, we will also multiply these newly-scaled data by a factor of \(100,000\). Finally, to make any future transformations on the data easier, we will add to each observation a value of 1, so that each of the scaled observations will be non-negative.
\[ Y_{ij}^{(1)} = \frac{Y_{ij}}{W_J}*100,000 + 5 \\ i \cong \text{Who Region, } j=\{1, 2, ... n_i\}, W_i \cong \text{Population of i'th Who Region} \\ \text {Equation 1: Scaling of Response Variable} \]
| Region | New Cases (Scaled) |
|---|---|
| Africa | 143098.63 |
| Americas | 88727.19 |
| Eastern Mediterranean | 69079.14 |
| Europe | 162734.17 |
| South-East Asia | 15059.29 |
| Western Pacific | 47895.78 |
As evident from Table 2, we can see that there are obvious differences in the (scaled) number of new cases between the regions. For instance — Europe and Africa have a much higher (scaled) number of new cases, than the other regions. A comparison in the (scaled) number of corona virus cases by region will be conducted via a statistical approach in the results section.
We will now proceed to determine the distribution of mask mandates, by WHO_region, as presented in the waffle charts below.
Figure 2: Waffle Chart for Proportions of Mask Mandates by Region
As evident from Figure 1, the differing regions had differences in their respective mask requirements. For Africa, it was most common for countries to require masks in all public spaces. For Europe, most countries required masks in public areas to an extent. For the Americas, most countries required masks in public areas. For the West Pacific, most countries either had no policy. Finally, the majority of South-East Asia required masks in some public areas.
Next, we shall analyze the distribution of our response variable: the (scaled) number of new cases, using a violin plot.
Figure 3: Violin Plot of (Scaled) New Cases by Region
As evident from Figure 3, it appears that there are minor differences in the number of new cases, between the different regions. This suggests that it would be worth including Who_region as a factor in our regression model. Next, we will view the distribution of the number of new cases, with respect to the different facial covering restrictions.
Figure 4: Violin Plot of (Scaled) New Cases by Facial Covering Restrictions
As presented in the Figure 4, there appears to be differences in the distributions of new cases by facial covering restrictions. Additionally, it appears that the variance between the distributions differ significantly – which may present a potential issue when fitting our model. Next, we will analyze the distribution of the number of new cases with respect to the different gathering restrictions.
Figure 5: Violin Plot of (Scaled) New Cases by Gathering Restrictions
As evident from Figure 5, these conditional distributions do not appear to share a similar variance. This may as well, pose a potential issue when fitting our model.
Figure 6: Violin Plot of (Scaled) New Cases by School Closure Levels
Figure 7: Violin Plot of (Scaled) New Cases by Event Cancellation Severity
As evident from the violin plots presented in Figure 7, these conditional distributions, again, appear to not share a common variance. Therefore, this suggests that a transformation may be needed to coerce the response variable into the homoskedasticity assumption.
Finally, we analyze the relationship between the (scaled) number of new_deaths, with the (scaled) number of new cases.
Figure 8: Scatterplot of New Deaths vs. New Cases
As evident from the scatterplot, there is a curved, parabolic, relationship between the number of people who recovered, and the (scaled) number of new cases. This suggests that (individually), the relationship between the number of new cases and the number of deaths, is non-linear – and hence, will be reflected in our model.
The purpose of this preliminary fit, is to gain inference on whether the variables selected (see Background) were significant in determining the number of new corona viruses cases. We now fit a simple, multiple regression model with all of the factors of interest, using the lm function from base.
\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + \beta_4X_{4i} + \beta_5X_{5i} + \beta_6X_{6i}+\epsilon_i \\ i = 1,2,...101,893, \text{ Where } \epsilon \stackrel{i.i.d.}{\sim} Normal(0, \sigma^2), \text{ and:} \\ X_1 = \text{New_deaths, } X_2 = \text{WHO_region, } X_3 = \text{gathering_restrictions, } \\ X_4 = \text{school_closing, } X_5 = \text{cancel_events, and }X_6 = \text{facial_coverings} \]
##
## Call:
## lm(formula = New_cases_scaled ~ New_deaths_scaled + WHO_region +
## factor(gatherings_restrictions) + factor(school_closing) +
## factor(cancel_events) + facial_coverings, data = covid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.255 -0.185 -0.044 0.057 61.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.877e+02 2.006e+00 -93.575 < 2e-16
## New_deaths_scaled 3.852e+01 4.013e-01 95.984 < 2e-16
## WHO_regionAmericas 5.847e-02 1.044e-02 5.599 2.16e-08
## WHO_regionEastern Mediterranean 1.094e-01 1.136e-02 9.629 < 2e-16
## WHO_regionEurope 3.202e-01 9.745e-03 32.863 < 2e-16
## WHO_regionSouth-East Asia 7.576e-02 2.037e-02 3.720 0.000200
## WHO_regionWestern Pacific 9.078e-02 1.290e-02 7.039 1.95e-12
## factor(gatherings_restrictions)1 1.956e-01 2.219e-02 8.814 < 2e-16
## factor(gatherings_restrictions)2 -2.203e-02 1.539e-02 -1.432 0.152282
## factor(gatherings_restrictions)3 -1.536e-01 1.373e-02 -11.193 < 2e-16
## factor(gatherings_restrictions)4 -2.629e-01 1.452e-02 -18.105 < 2e-16
## factor(school_closing)1 -5.847e-02 1.042e-02 -5.613 2.00e-08
## factor(school_closing)2 -1.207e-01 1.216e-02 -9.925 < 2e-16
## factor(school_closing)3 -1.773e-01 1.158e-02 -15.319 < 2e-16
## factor(cancel_events)1 1.071e-01 1.355e-02 7.903 2.74e-15
## factor(cancel_events)2 2.764e-01 1.426e-02 19.381 < 2e-16
## facial_coveringsRecommended 5.141e-02 1.562e-02 3.292 0.000997
## facial_coveringsRequired Partially 6.503e-02 1.199e-02 5.425 5.82e-08
## facial_coveringsRequired in Public 1.757e-01 1.048e-02 16.767 < 2e-16
## facial_coveringsRequired Out of House 2.196e-01 1.206e-02 18.212 < 2e-16
##
## (Intercept) ***
## New_deaths_scaled ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.037 on 101883 degrees of freedom
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.1196
## F-statistic: 729.2 on 19 and 101883 DF, p-value: < 2.2e-16
From the summary function output, we can see that (almost) every term in the model was significantly different from 0. However, the value of the \(R^2_{adj}\) is very low, suggesting that not much variability was explained by the model. To diagnose the potential issues, we shall utilize a residual a QQ-plot, and the Cook’s Distance Criteria. We set the threshold, that if a residual obtains a Cook’s distance of 0.1, or greater, than it will be considered an influential point.
As evident from the residual plot, the homoskedasticity assumption of the regression, appears to be violated. A fan effect is demonstrated in the residual plot, along with a few notable outliers. There also appears to be a slight curvature to the residuals. This suggests that adding a second power New_deaths_scaled term may help to capture more variance in the number of new corona virus cases. Additionally, we can see from the normality plot that the distribution is very right skewed. Next, we can see that there were a few outliers that were influential points, such as observation \(27228\). Before we move onto our next fit, we shall eliminate the influential points in the model, utilize the Box-Cox method, in order to find a suitable transformation for the data.
According to the Box-Cox transformation method, the best suited transformation of the response variable is \(Y^{-2}\). Next, we will implement the transformation on our model, with the high influential points removed.
For our next model, we will take the advice of the Box-Cox procedure, and include a second-order term for the New_deaths_scaled variable. Additionally, we will remove all the observations that had a Cook’s Distance measurement of \(0.1\) or greater.
\[ (Y_i)^{-2} = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + \beta_4X_{4i} + \beta_5X_{5i} + \beta_6X_{6i}+ \beta_7(X_{6i})^2 + \epsilon_i \\ i = 1,2,...101,893, \text{ Where } \epsilon \stackrel{i.i.d.}{\sim} Normal(0, \sigma^2), \text{ and:} \\ X_1 = \text{New_deaths, } X_2 = \text{WHO_region, } X_3 = \text{gathering_restrictions, } \\ X_4 = \text{school_closing, } X_5 = \text{cancel_events, and }X_6 = \text{facial_coverings} \]
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ New_deaths_scaled + I(New_deaths_scaled^2) +
## WHO_region + factor(gatherings_restrictions) + factor(school_closing) +
## factor(cancel_events) + facial_coverings, data = covid_outlier_free)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08742 -0.00035 0.00033 0.00103 0.32296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.976e+01 6.098e-01 114.382 < 2e-16
## New_deaths_scaled -2.738e+01 2.419e-01 -113.193 < 2e-16
## I(New_deaths_scaled^2) 2.688e+00 2.399e-02 112.049 < 2e-16
## WHO_regionAmericas -2.797e-04 3.282e-05 -8.524 < 2e-16
## WHO_regionEastern Mediterranean -1.035e-03 3.584e-05 -28.886 < 2e-16
## WHO_regionEurope -1.753e-03 3.086e-05 -56.790 < 2e-16
## WHO_regionSouth-East Asia -5.583e-04 6.398e-05 -8.726 < 2e-16
## WHO_regionWestern Pacific -4.561e-04 4.051e-05 -11.260 < 2e-16
## factor(gatherings_restrictions)1 -3.520e-04 6.971e-05 -5.049 4.45e-07
## factor(gatherings_restrictions)2 2.878e-04 4.834e-05 5.954 2.63e-09
## factor(gatherings_restrictions)3 5.154e-04 4.313e-05 11.951 < 2e-16
## factor(gatherings_restrictions)4 8.397e-04 4.562e-05 18.408 < 2e-16
## factor(school_closing)1 3.951e-04 3.272e-05 12.075 < 2e-16
## factor(school_closing)2 8.533e-04 3.820e-05 22.334 < 2e-16
## factor(school_closing)3 9.042e-04 3.637e-05 24.863 < 2e-16
## factor(cancel_events)1 -5.892e-04 4.258e-05 -13.838 < 2e-16
## factor(cancel_events)2 -1.112e-03 4.484e-05 -24.806 < 2e-16
## facial_coveringsRecommended -5.713e-04 4.906e-05 -11.644 < 2e-16
## facial_coveringsRequired Partially -9.387e-04 3.767e-05 -24.921 < 2e-16
## facial_coveringsRequired in Public -1.076e-03 3.301e-05 -32.587 < 2e-16
## facial_coveringsRequired Out of House -1.215e-03 3.804e-05 -31.927 < 2e-16
##
## (Intercept) ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003256 on 101880 degrees of freedom
## Multiple R-squared: 0.4292, Adjusted R-squared: 0.4291
## F-statistic: 3830 on 20 and 101880 DF, p-value: < 2.2e-16
As evident from the summary output, we can see that every term in the model is now statistically significant. Additionally, we can see that the value of the \(R^2_{adj}\) has increased to \(0.4291\). With this improvement in the model, we shall now turn to residual analysis in order to ensure that our model assumptions have been met.
As evident from the residual plot, there appears to be multiple outliers present. However, there homoskedasticity assumption appears to have been improved. Additionally, we can see that the normality assumption of the data has improved, but still deviates slightly along the more extreme theoretical quantiles. Next, from the Cook’s Distance plots, we can see that there are a few outliers that are influential points. Therefore, in hope of fixing the homoskedasticity and the normality assumptions, we shall remove these influential points, and refit the model.
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ New_deaths_scaled + I(New_deaths_scaled^2) +
## WHO_region + factor(gatherings_restrictions) + factor(school_closing) +
## factor(cancel_events) + facial_coverings, data = covid_without_residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.036820 -0.000344 0.000313 0.000997 0.021564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.694e+01 7.041e-01 123.468 < 2e-16
## New_deaths_scaled -3.421e+01 2.795e-01 -122.413 < 2e-16
## I(New_deaths_scaled^2) 3.366e+00 2.773e-02 121.398 < 2e-16
## WHO_regionAmericas -2.742e-04 3.093e-05 -8.867 < 2e-16
## WHO_regionEastern Mediterranean -9.657e-04 3.381e-05 -28.563 < 2e-16
## WHO_regionEurope -1.690e-03 2.913e-05 -58.009 < 2e-16
## WHO_regionSouth-East Asia -5.451e-04 6.029e-05 -9.042 < 2e-16
## WHO_regionWestern Pacific -4.507e-04 3.817e-05 -11.809 < 2e-16
## factor(gatherings_restrictions)1 -3.392e-04 6.568e-05 -5.164 2.43e-07
## factor(gatherings_restrictions)2 2.751e-04 4.555e-05 6.040 1.55e-09
## factor(gatherings_restrictions)3 4.957e-04 4.064e-05 12.198 < 2e-16
## factor(gatherings_restrictions)4 8.276e-04 4.298e-05 19.254 < 2e-16
## factor(school_closing)1 3.952e-04 3.083e-05 12.817 < 2e-16
## factor(school_closing)2 8.744e-04 3.600e-05 24.290 < 2e-16
## factor(school_closing)3 8.840e-04 3.427e-05 25.799 < 2e-16
## factor(cancel_events)1 -5.634e-04 4.012e-05 -14.042 < 2e-16
## factor(cancel_events)2 -1.071e-03 4.227e-05 -25.328 < 2e-16
## facial_coveringsRecommended -5.493e-04 4.624e-05 -11.880 < 2e-16
## facial_coveringsRequired Partially -9.288e-04 3.549e-05 -26.170 < 2e-16
## facial_coveringsRequired in Public -1.040e-03 3.113e-05 -33.406 < 2e-16
## facial_coveringsRequired Out of House -1.146e-03 3.588e-05 -31.939 < 2e-16
##
## (Intercept) ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003068 on 101877 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4633
## F-statistic: 4400 on 20 and 101877 DF, p-value: < 2.2e-16
## 90525 90527 90528 90529
## 90525 90527 90528 90529
As evident from the leverage plots, there are still a few influential points present in the data. Therefore, we shall refit the model with these influential points removed.
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ New_deaths_scaled + I(New_deaths_scaled^2) +
## WHO_region + factor(gatherings_restrictions) + factor(school_closing) +
## factor(cancel_events) + facial_coverings, data = covid_without_residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.036771 -0.000339 0.000312 0.000989 0.020784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.168e+01 7.334e-01 125.003 < 2e-16
## New_deaths_scaled -3.610e+01 2.911e-01 -123.982 < 2e-16
## I(New_deaths_scaled^2) 3.554e+00 2.889e-02 122.999 < 2e-16
## WHO_regionAmericas -2.697e-04 3.085e-05 -8.740 < 2e-16
## WHO_regionEastern Mediterranean -9.494e-04 3.374e-05 -28.144 < 2e-16
## WHO_regionEurope -1.673e-03 2.906e-05 -57.568 < 2e-16
## WHO_regionSouth-East Asia -5.405e-04 6.014e-05 -8.987 < 2e-16
## WHO_regionWestern Pacific -4.470e-04 3.808e-05 -11.739 < 2e-16
## factor(gatherings_restrictions)1 -3.362e-04 6.552e-05 -5.131 2.89e-07
## factor(gatherings_restrictions)2 2.719e-04 4.544e-05 5.984 2.18e-09
## factor(gatherings_restrictions)3 4.919e-04 4.054e-05 12.136 < 2e-16
## factor(gatherings_restrictions)4 8.218e-04 4.288e-05 19.166 < 2e-16
## factor(school_closing)1 3.936e-04 3.076e-05 12.798 < 2e-16
## factor(school_closing)2 8.751e-04 3.591e-05 24.369 < 2e-16
## factor(school_closing)3 8.837e-04 3.418e-05 25.855 < 2e-16
## factor(cancel_events)1 -5.598e-04 4.002e-05 -13.988 < 2e-16
## factor(cancel_events)2 -1.063e-03 4.216e-05 -25.207 < 2e-16
## facial_coveringsRecommended -5.456e-04 4.612e-05 -11.829 < 2e-16
## facial_coveringsRequired Partially -9.247e-04 3.540e-05 -26.117 < 2e-16
## facial_coveringsRequired in Public -1.025e-03 3.106e-05 -33.001 < 2e-16
## facial_coveringsRequired Out of House -1.128e-03 3.581e-05 -31.514 < 2e-16
##
## (Intercept) ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00306 on 101873 degrees of freedom
## Multiple R-squared: 0.4651, Adjusted R-squared: 0.465
## F-statistic: 4430 on 20 and 101873 DF, p-value: < 2.2e-16
As evident from summary output, we can see that the \(R^2_{adj}\) has increased to a value of 46.5%. We will now analyze the residuals, once more.
As evident from the residual plots, it appears that the homoskedasticity assumption may still be violated. Additionally, the normality assumption of the residuals may have been violated. Therefore, we will consider utilizing another Box-Cox transformation. However. due to the large sample size of this dataset, I believe that we may be more lenient with these assumptions.
Determining whether to use Likelihood Ratio Tests, or continue with F-tests. Awaiting confirmation from Miss Zitong.
In this dataset, causal inference is not feasible. The reason we are unable to conduct causal inference in this scenario, is that the Stable Unit Treatment Value Assumptions (SUTVA) have been violated. One of the most demonstrative violations of the SUTVA, lies in the facial_covering variable. Some people have utilized N95 masks as face coverings throughout the pandemic, while others have opted for less- protective options, such as Face Shields or neck-gaiters. Since this treatment is tailored to each individuals personal preferences, we cannot make causal inference. Additionally, we cannot utilize the ignorability assumption, as there are no covariates for which we can assume that the assignment of our factors are conditionally independent of the potential outcomes, in the its presence. Therefore, we are left to suffer the drawbacks associated with a lack of randomization.
Overall, we can conclude that the factors facial_covering, cancel_events, WHO_region, gatherings_restrictions, , New_deaths_scaled, and school_closing had a significant effect on the number of new cases.
https://ourworldindata.org/coronavirus-testing
https://data.worldbank.org/indicator/SP.POP.TOTL
Most countries’ populations are based off of data from the year 2020. However, some countries such as Eritrea, haven’t conducted an official census as recently, and thus, estimates of the country’s population have been replaced with 2011 estimates.↩︎